Denoising Diffusion Probabilistic Models (DDPMs) are emerging in text-to-speech (TTS) synthesis because of their strong capability of generating high-fidelity samples. However, their iterative refinement process in high-dimensional data space results in slow inference speed, which restricts their application in real-time systems. Previous works have explored speeding up by minimizing the number of inference steps but at the cost of sample quality. In this work, to improve the inference speed for DDPM-based TTS model while achieving high sample quality, we propose ResGrad, a lightweight diffusion model which learns to refine the output spectrogram of an existing TTS model (e.g., FastSpeech 2) by predicting the residual between the model output and the corresponding ground-truth speech. ResGrad has several advantages: 1) Compare with other acceleration methods for DDPM which need to synthesize speech from scratch, ResGrad reduces the complexity of task by changing the generation target from ground-truth mel-spectrogram to the residual, resulting into a more lightweight model and thus a smaller real-time factor. 2) ResGrad is employed in the inference process of the existing TTS model in a plug-and-play way, without re-training this model. We verify ResGrad on the single-speaker dataset LJSpeech and two more challenging datasets with multiple speakers (LibriTTS) and high sampling rate (VCTK). Experimental results show that in comparison with other speed-up methods of DDPMs: 1) ResGrad achieves better sample quality with the same inference speed measured by real-time factor; 2) with similar speech quality, ResGrad synthesizes speech faster than baseline methods by more than 10 times. Audio samples are available at https://resgrad1.github.io/.
translated by 谷歌翻译
Recently, neural networks have proven their impressive ability to solve partial differential equations (PDEs). Among them, Fourier neural operator (FNO) has shown success in learning solution operators for highly non-linear problems such as turbulence flow. FNO is discretization-invariant, where it can be trained on low-resolution data and generalizes to problems with high-resolution. This property is related to the low-pass filters in FNO, where only a limited number of frequency modes are selected to propagate information. However, it is still a challenge to select an appropriate number of frequency modes and training resolution for different PDEs. Too few frequency modes and low-resolution data hurt generalization, while too many frequency modes and high-resolution data are computationally expensive and lead to over-fitting. To this end, we propose Incremental Fourier Neural Operator (IFNO), which augments both the frequency modes and data resolution incrementally during training. We show that IFNO achieves better generalization (around 15% reduction on testing L2 loss) while reducing the computational cost by 35%, compared to the standard FNO. In addition, we observe that IFNO follows the behavior of implicit regularization in FNO, which explains its excellent generalization ability.
translated by 谷歌翻译
Facial expression recognition (FER) plays a significant role in the ubiquitous application of computer vision. We revisit this problem with a new perspective on whether it can acquire useful representations that improve FER performance in the image generation process, and propose a novel generative method based on the image inversion mechanism for the FER task, termed Inversion FER (IFER). Particularly, we devise a novel Adversarial Style Inversion Transformer (ASIT) towards IFER to comprehensively extract features of generated facial images. In addition, ASIT is equipped with an image inversion discriminator that measures the cosine similarity of semantic features between source and generated images, constrained by a distribution alignment loss. Finally, we introduce a feature modulation module to fuse the structural code and latent codes from ASIT for the subsequent FER work. We extensively evaluate ASIT on facial datasets such as FFHQ and CelebA-HQ, showing that our approach achieves state-of-the-art facial inversion performance. IFER also achieves competitive results in facial expression recognition datasets such as RAF-DB, SFEW and AffectNet. The code and models are available at https://github.com/Talented-Q/IFER-master.
translated by 谷歌翻译
Image super-resolution is a common task on mobile and IoT devices, where one often needs to upscale and enhance low-resolution images and video frames. While numerous solutions have been proposed for this problem in the past, they are usually not compatible with low-power mobile NPUs having many computational and memory constraints. In this Mobile AI challenge, we address this problem and propose the participants to design an efficient quantized image super-resolution solution that can demonstrate a real-time performance on mobile NPUs. The participants were provided with the DIV2K dataset and trained INT8 models to do a high-quality 3X image upscaling. The runtime of all models was evaluated on the Synaptics VS680 Smart Home board with a dedicated edge NPU capable of accelerating quantized neural networks. All proposed solutions are fully compatible with the above NPU, demonstrating an up to 60 FPS rate when reconstructing Full HD resolution images. A detailed description of all models developed in the challenge is provided in this paper.
translated by 谷歌翻译
整个幻灯片图像(WSI)分类通常依赖于深度监督的多个实例学习(MIL)方法来处理Gigapixel分辨率图像和幻灯片级标签。然而,深度学习的不错的表现来自利用大量数据集和不同的样本,敦促需要有效的培训管道来扩展到大型数据集和数据增强技术以进行多元化样品。但是,当前基于MIL的WSI分类管道是内存量的且计算的,因为它们通常组装成千上万的补丁作为计算袋。另一方面,尽管它们在其他任务中很受欢迎,但对于WSI MIL Frameworks来说,数据增强尚未探索。为了解决它们,我们提出了Remix,这是基于MIL WSI分类的一般有效框架。它包括两个步骤:减少和混合。首先,它通过用实例原型(即贴片群质心)代替实例,从而减少了WSI袋中的实例数量。然后,我们提出了一个``混合式''增强,其中包含四个在线,随机和灵活的潜在空间扩展。它带来了潜在空间的多样化和可靠的班级身份的语义变化,同时实施语义扰动不变性。我们通过两种最先进的MIL方法在两个公共数据集上评估混音。在我们的实验中,已经实现了精确度,准确性和召回率的一致提高,但随着训练时间和记忆消耗的减少阶段,它表明了混音的有效性和效率。代码可用。
translated by 谷歌翻译
聚类是一项基本的机器学习任务,在文献中已广泛研究。经典聚类方法遵循以下假设:数据通过各种表示的学习技术表示为矢量化形式的特征。随着数据变得越来越复杂和复杂,浅(传统)聚类方法无法再处理高维数据类型。随着深度学习的巨大成功,尤其是深度无监督的学习,在过去的十年中,已经提出了许多具有深层建筑的代表性学习技术。最近,已经提出了深层聚类的概念,即共同优化表示的学习和聚类,因此引起了社区的日益关注。深度学习在聚类中的巨大成功,最基本的机器学习任务之一以及该方向的最新进展的巨大成功所激发。 - 艺术方法。我们总结了深度聚类的基本组成部分,并通过设计深度表示学习和聚类之间的交互方式对现有方法进行了分类。此外,该调查还提供了流行的基准数据集,评估指标和开源实现,以清楚地说明各种实验设置。最后但并非最不重要的一点是,我们讨论了深度聚类的实际应用,并提出了应有的挑战性主题,应将进一步的研究作为未来的方向。
translated by 谷歌翻译
Binaural audio plays a significant role in constructing immersive augmented and virtual realities. As it is expensive to record binaural audio from the real world, synthesizing them from mono audio has attracted increasing attention. This synthesis process involves not only the basic physical warping of the mono audio, but also room reverberations and head/ear related filtrations, which, however, are difficult to accurately simulate in traditional digital signal processing. In this paper, we formulate the synthesis process from a different perspective by decomposing the binaural audio into a common part that shared by the left and right channels as well as a specific part that differs in each channel. Accordingly, we propose BinauralGrad, a novel two-stage framework equipped with diffusion models to synthesize them respectively. Specifically, in the first stage, the common information of the binaural audio is generated with a single-channel diffusion model conditioned on the mono audio, based on which the binaural audio is generated by a two-channel diffusion model in the second stage. Combining this novel perspective of two-stage synthesis with advanced generative models (i.e., the diffusion models),the proposed BinauralGrad is able to generate accurate and high-fidelity binaural audio samples. Experiment results show that on a benchmark dataset, BinauralGrad outperforms the existing baselines by a large margin in terms of both object and subject evaluation metrics (Wave L2: 0.128 vs. 0.157, MOS: 3.80 vs. 3.61). The generated audio samples (https://speechresearch.github.io/binauralgrad) and code (https://github.com/microsoft/NeuralSpeech/tree/master/BinauralGrad) are available online.
translated by 谷歌翻译
检测欺诈性交易是控制​​电子商务市场风险的重要组成部分。除了已经在生产中部署的基于规则和机器学习过滤器外,我们还希望使用图形神经网络(GNN)进行有效的实时推理,这对于在事务图中捕获多跃风风险传播非常有用。但是,在生产中实施GNN时出现了两个挑战。首先,在消息传递中不应考虑以预测过去中的动态图中的未来信息。其次,图形查询和GNN模型推断的延迟通常高达数百毫秒,这对于某些关键的在线服务来说是昂贵的。为了应对这些挑战,我们提出了一个批处理和实时的成立图拓扑(BRIGHT)框架,以进行端到端的GNN学习,以允许有效的在线实时推理。 Bright框架由图形转换模块(两阶段有向图)和相应的GNN体系结构(Lambda神经网络)组成。两阶段的指示图保证了通过邻居传递的信息仅来自历史支付交易。它分别由代表历史关系和实时链接的两个子图组成。 Lambda神经网络将推断分为两个阶段:实体嵌入的批次推断和交易预测的实时推断。我们的实验表明,在平均W.R.T.〜精确度中,BRIGHT优于基线模型> 2 \%。此外,BRIGHT在实时欺诈检测上在计算上是有效的。关于端到端性能(包括邻居查询和推理),BRIGHT可以将P99延迟降低> 75 \%。对于推理阶段,与传统GNN相比,我们的加速平均为7.8美元。
translated by 谷歌翻译
我们提出了一个新的学习框架,该框架捕获了许多真实世界用户交互应用程序的分层结构,在该框架中,可以根据探索风险的不同公差将用户分为两组,并应分别处理。在这种情况下,我们同时维护两个政策$ \ pi^{\ text {o}} $和$ \ pi^{\ text {e}} $:$ \ pi^{\ pi^{\ text {o}}} $(“ o “对于“在线”)与第一层的更具风险的用户进行互动,并像往常一样平衡探索和剥削来最大程度地减少后悔,而$ \ pi^{\ text {e}} $(“ e” for“ exploit”)专注于利用到目前为止收集的数据,从第二层的规避风险用户进行剥削。一个重要的问题是,这种分离是否比标准在线设置(即$ \ pi^{\ text {e}} = \ pi^{\ text {o}} $)是否产生优势。我们单独考虑与差距无关的与差距依赖性设置。对于前者来说,我们证明从最小值的角度来看,分离确实不是有益的。对于后者,我们表明,如果选择悲观的价值迭代作为剥削算法来产生$ \ pi^{\ text {e}} $,我们可以不断地对无独立的风险用户$ k的数量来实现遗憾$,与$ \ omega(\ log k)$相同的$ \ omega(\ log k)$在同一环境中遗憾在线遗憾的最优性,不需要为成功的成功而妥协。
translated by 谷歌翻译
部署效率是许多实际应用程序应用(RL)的重要标准。尽管社区的兴趣越来越大,但对于该问题缺乏正式的理论表述。在本文中,我们从“具有约束的优化”的角度提出了一种用于部署有效的RL(DE-RL)的公式:我们有兴趣探索MDP并在最小值{部署复杂性}中获得近乎最佳的策略。 ,而在每个部署中,策略可以采样大量数据。使用有限的摩尼子线性MDP作为具体的结构模型,我们通过建立信息理论下限,并提供实现最佳部署效率的算法来揭示实现部署效率的基本限制。此外,我们对DE-RL的配方是灵活的,可以作为其他实际相关设置的基础;我们将“安全的DE-RL”和“样本有效的DE-RL”作为两个例子,这可能是值得将来的研究。
translated by 谷歌翻译